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Abstract 

We develop a fully discriminative learning approach for supervised Latent Dirich- 
let Allocation (LDA) model using Back Propagation (i.e., BP-sLDA), which max¬ 
imizes the posterior probability of the prediction variable given the input doc¬ 
ument. Different from traditional variational learning or Gibbs sampling ap¬ 
proaches, the proposed learning method applies (i) the mirror descent algorithm 
for maximum a posterior inference and (ii) back propagation over a deep architec¬ 
ture together with stochastic gradient/mirror descent for model parameter estima¬ 
tion, leading to scalable and end-to-end discriminative learning of the model. As 
a byproduct, we also apply this technique to develop a new learning method for 
the traditional unsupervised LDA model (i.e., BP-LDA). Experimental results on 
three real-world regression and classification tasks show that the proposed meth¬ 
ods significantly outperform the previous supervised topic models, neural net¬ 
works, and is on par with deep neural networks. 


1 Introduction 


Latent Dirichlet Allocation (LDA) Q, among various forms of topic models, is an important prob¬ 
abilistic generative model for analyzing large collections of text corpora. In LDA, each document is 
modeled as a collection of words, where each word is assumed to be generated from a certain topic 
drawn from a topic distribution. The topic distribution can be viewed as a latent representation of 
the document, which can be used as a feature for prediction purpose (e.g., sentiment analysis). In 
particular, the inferred topic distribution is fed into a separate classifier or regression model (e.g., 
logistic regression or linear regression) to perform prediction. Such a separate learning structure 
usually significantly restricts the performance of the algorithm. Eor this purpose, various super¬ 
vised topic models have been proposed to model the documents jointly with the label information. 
In 0, variational methods was applied to learn a supervised LDA (sLDA) model by maximizing 
the lower bound of the joint probability of the input data and the labels. The DiscLDA method 
developed in 1151 learns the transformation matrix from the latent topic representation to the out¬ 
put in a discriminative manner, while learning the topic to word distribution in a generative manner 
similar to the standard LDA. In f2E\ , max margin supervised topic models are developed for classi¬ 
fication and regression, which are trained by optimizing the sum of the variational bound for the log 
marginal likelihood and an additional term that characterizes the prediction margin. These methods 
successfully incorporate the information from both the input data and the labels, and showed better 
performance in prediction compared to the vanilla LDA model. 


One challenge in LDA is that the exact inference is intractable, i.e., the posterior distribution of the 
topics given the input document cannot be evaluated explicitly. Eor this reason, various approximate 
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Figure 1: Graphical representation of the supervised LDA model. Shaded nodes are observables. 


inference methods are proposed, such as variational learning and Gibbs sampling 0123 , 

for computing the approximate posterior distribution of the topics. In this paper, we will show that, 
although the full posterior probability of the topic distribution is difficult, its maximum a posteriori 
(MAP) inference, as a simplified problem, is a convex optimization problem when the Dirichlet pa¬ 
rameter satisfies certain conditions, which can be solved efficiently by the mirror descent algorithm 
(MDA) 1 2|18pT) . Indeed, Sontag and Roy [19) pointed out that the MAP inference problem of LDA 
in this situation is polynomial-time and can be solved by an exponentiated gradient method, which 
shares a same form as our mirror-descent algorithm with constant step-size. Nevertheless, different 
from 1191, which studied the inference problem alone, our focus in this paper is to integrate back 
propagation with mirror-descent algorithm to perform fully discriminative training of supervised 
topic models, as we proceed to explain below. 

Among the aforementioned methods, one training objective of the supervised LDA model is to max¬ 
imize the joint likelihood of the input and the output variables Q. Another variant is to maximize 
the sum of the log likelihood (or its variable bound) and a prediction margin 126 |. Moreover, 

the DiscLDA optimizes part of the model parameters by maximizing the marginal likelihood of the 
input variables, and optimizes the other part of the model parameters by maximizing the condi¬ 
tional likelihood. For this reason, DiscLDA is not a fully discriminative training of all the model 
parameters. In this paper, we propose a fully discriminative training of all the model parameters by 
maximizing the posterior probability of the output given the input document. We will show that the 
discriminative training can be performed in a principled manner by naturally integrating the back- 
propagation with the MDA-based exact MAP inference. To our best knowledge, this paper is the 
first work to perform a fully end-to-end discriminative training of supervised topic models. Dis¬ 
criminative training of generative model is widely used and usually outperforms standard generative 
training in prediction tasks ||3|3|l3[I51lll) ■ pointed out in Q, discriminative training increases 
the robustness against the mismatch between the generative model and the real data. Experimental 
results on three real-world tasks also show the superior performance of discriminative training. 

In addition to the aforementioned related studies on topic models there have been 

another stream of work that applied empirical risk minimization to graphical models such as Markov 
Random Field and nonnegative matrix factorization 110 |^|. Specifically, in pO) , an approximate 
inference algorithm, belief propagation, is used to compute the belief of the output variables, which 
is further fed into a decoder to produce the prediction. The approximate inference and the decoder 
are treated as an entire black-box decision rule, which is tuned jointly via back propagation. Our 
work is different from the above studies in that we use an MAP inference based on optimization 
theory to motivate the discriminative training from a principled probabilistic framework. 


2 Smoothed Supervised LDA Model 


We consider the smoothed supervised LDA model in Figure Let K be the number of topics, 
N be the number of words in each document, V be the vocabulary size, and D be the number of 
documents in the corpus. The generative process of the model in Figure[3can be described as: 


1. For each document d, choose the topic proportions according to a Dirichlet distribution: 
Sd ~ p{(^dW) = Dir(a), where a is a AT x 1 vector consisting of nonnegative components. 

2. Draw each column of a E x K matrix $ independently from an exchangeable Dirichlet 
distribution: (j)k ^ Dir(/3) (i.e., <!> ^ p($|/3)), where /3 > 0 is the smoothing parameter. 

3. To generate each word Wd,n' 
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(a) Choose a topic Zd,n ~ pizd,n\Sd) = Multinomial(0d)- 

(b) Choose a word Wd^n ~ p{'UJd,n\zd,n,^) = Multinoinial(0^_j ^). 

4. Choose the Cxi response vector; ~ p{yd\(), U, 7 ). 

(a) In regression, p{yd\0d, U, 7 ) = N{U 9d, 7 ~^), where U isaC x K matrix consisting 
of regression coefficients. 

(b) In multi-class classification, p{yd\9d, U, 7 ) = Multinomial(Softmax(7[/0d))j where 

the softmax function is defined as Softmax(a:)c = ^ ~ 1, ■ ■. ,C. 


Therefore, the entire model can be described by the following joint probability 

D 

p{^\l3) TT \p{yd\0d,U,'f) ■p{dd\a) ■ p{Wd,l:N\Zd,l:N,^) ■ PiZd.l:N\0d) 

Lv---/ 


d=l 


=p{yd,6d,Wd,i:N ,Zd.l-.N\^,U,a,-f) 


( 1 ) 


where Wd,i-.N Zd^i-.N denotes all the words and the associated topics, respectively, in the d-th 
document. Note that the model in Figure [T is slightly different from the one proposed in Q, where 
the response variable yd in Figure[T|is coupled with 0d instead of Zd.i-.N as in Q. Blei and Mcauliffe 
also pointed out this choice as an alternative in Q. This modification will lead to a differentiable 
end-to-end cost trainable by back propagation with superior prediction performance. 

To develop a fully discriminative training method for the model parameters $ and U, we follow the 
argument in |^, which states that the discriminative training is also equivalent to maximizing the 
joint likelihood of a new model family with an additional set of parameters: 


D 


D 


arg ma.xp{^\l3)p{^\P) Y[p{yd\wd.i-.N, a) (2) 








where p{wd,i:N\^:a) is obtained by marginalizing p{yd, Od,Wd,i-.N, Zd^i-.N\^,U, in Q and 
replace $ with The above problem ^ decouples into 


D 


arg max 

L 


arg max 
!• 


lnp($|/3) -f \np{yd\wd,i-.N, U, a, 7 ) 

d=l 
D 

lnp($|/3) + lnp('u;d_i:Ar|$, a) 


(3) 

(4) 


d=l 


which are the discriminative learning problem of supervised LDA (Eq. (|^), and the unsupervised 
learning problem of LDA (Eq. Q), respectively. We will show that both problems can be solved in 
a unified manner using a new MAP inference and back propagation. 


3 Maximum A Posterior (MAP) Inference 

We first consider the inference problem in the smoothed LDA model. Eor the supervised case, the 
main objective is to infer yd given the words Wd,i-N in each document d, i.e., computing 

p{yd\wd,i-N,^,U,a,-f) = / p{yd\dd,U,^)p{9d\wd,i-.N,^,a)d9d (5) 

dSd 

where the probability p{yd\9d, U,7) is known (e.g., multinomial or Gaussian for classification and 
regression problems — see Section The main challenge is to evaluate p{9d\wd,i:N, ‘h, a), i.e., 
infer the topic proportion given eacn document, which is also the important inference problem in 
the unsupervised LDA model. However, it is well known that the exact evaluation of the posterior 
probability p(0d|tyd,i:Ar, 4>, a) is intractable p] |5][^[T5||26||27) . Eor this reason, various approximate 
inference methods, such as variational inference |4lp| |15||26| and Gibbs sampling |[9lp7|, have been 

*We will represent all the multinomial variables by a one-hot vector that has a single component equal to 
one at the position determined by the multinomial variable and all other components being zero. 


3 













proposed to compute the approximate posterior probability. In this paper, we take an alternative ap¬ 
proach for inference; given each document d, we only seek a point (MAP) estimate of 6d, instead of 
its full (approximate) posterior probability. The major motivation is that, although the full posterior 
probability of dd is difficult, its MAP estimate, as a simplified problem, is more tractable (and it is a 
convex problem under certain conditions). Furthermore, with the MAP estimate of 6d, we can infer 
the prediction variable yd according to the following approximation from 0: 

p{yd\wd.i-.N,‘^,U,a,'y) = [piyd\dd,U,-f)] « 17,7) (6) 

where ^ denotes the conditional expectation with respect to 9d given Wd^i-.N, and the ex¬ 

pectation is sampled by the MAP estimate, Od\wd i n’ ^d given Wd,i-.N, defined as 

^d\wd,i.N = argiiiaxp(6»d|iUd.i:Ar,^',a,/3) (7) 

Od 

The approximation gets more precise when p{6d\wd,i-.N,^,<^, P) becomes more concentrated 
around 0^^ Experimental results on several real datasets (Section show that the approx¬ 
imation 0 provides excellent prediction performance. 

Using the Bayesian rule p{9d\wd,i-.N, a) = p{9d\a)p{wd^i-.N\0d, ^)/p{wdi-.N\^, a) and the fact 
that p{wd,i:N\^, ct) is independent of 9d, we obtain the equivalent form of 0 as 

^d|u;rf,i,N = arg max [\np{9d\a) + lnp{wd,i-.N\Od,^)] (8) 


where Vk = {9 € : 9j > 0, J2f=i = 1} denotes the {K — 1)-dimensional probability 

simplex, p{9d\a) is the Dirichlet distribution, and p{'Wd,i-.N\dd, ‘J?) can be computed by integrating 

p{wd,i:N, Zd,i-.N\0d, $) = nn=i p{wd,n\zd,n, ^)p{zd,n\dd) ovcT Zd,i:N, which leads to (derived in 
Sectionj^of the supplementary material) 

V . K .xd,. 

p{Wd,l:N\0d,^) = HE 0d,j^vj] =p{xd\9d,^) (9) 

v=i ^j=i 7 

where denotes the term frequency of the u-th word (in vocabulary) inside the d-th document, 
and Xd denotes the U-dimensional bag-of-words (BoW) vector of the d-th document. Note that 
p{wd,i:N\0d, *&) depends on Wd,i-N only via the BoW vector Xd, which is the sufficient statistics. 
Therefore, we use p{xd\9d^ $) and p(wd i-.N\0di interchangeably from now on. Substituting the 
expression of Dirichlet distribution and 0 into 0’ we get 

0d\wdi.N = arg miM [xj ln($6id) + (a - 1)^ IndJ 

’ Od^VK 

= arg min [ - ln($dd) - (a - 1)^ Ind^] (10) 


where we dropped the terms independent of 9d, and 1 denotes an all-one vector. Note that when 
a > 1 (a > 1), the optimization problem (fTOli is (strictly) convex and is non-convex otherwise. 


3.1 Mirror Descent Algorithm for MAP Inference 


An efficient approach to solving the constrained optimization problem ( [T0| ) is the mirror descent 
^orithm (MDA) with Bregman divergence chosen to be generalized Kullback-Leibler divergence 
| 0[T^[^ . Specihcally, let f{9d) denote the cost function in ( [T0| ), then the MDA updates the MAP 
estimate of 9d iteratively according to: 


9d,t. = arg min 

SdG-PN 


f{0d,i-i) + [Ve_j/(dd,r-i)]^(^d — 8d,i-i) + 


1 

Td,i 


'^{dd, 0d,t-l) 


( 11 ) 


9d/ denotes the estimate of 9d^t at the £-th iteration, Td^i denotes the step-size of MDA, and y) 
is the Bregman divergence chosen to be 'I'(a;, y) = x'^ \n.{x/y) — x + The argmin in 
can be solved in closed-form (see Sectionj^of the supplementary material) as 


0d,e = -pr ■ © exp Td/ 




Xd 


a — 1 


^0d4-l 0d,e-i 


7=1,...,L, 9d,0 = ^1 


( 12 ) 
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Figure 2: Layered deep architecture for computing p{yd\wd,i:Nj U, a, 7), where ()/() denotes 
element-wise division, © denotes Hadamard product, and exp() denotes element-wise exponential. 

where Cg is a normalization factor such that 9d^e adds up to one, © denotes Hadamard product, L is 
the number of MDA iterations, and the divisions in ( fTSl i are element-wise operations. Note that the 
recursion © naturally enforces each 9d^i to be on the probability simplex. The MDA step-size Td^i 
can be either constant, i.e., Td^ = T, or adaptive over iterations and samples, determined by line 
search (see Section]^ of the supplementary material). The computation complexity in ( [T^ is low 
since most computations are sparse matrix operations. For example, although by itself ^9d e-i in 
is a dense matrix multiplication, we only need to evaluate the elements of ^9d/-i at the posi¬ 
tions where the corresponding elements of Xd are nonzero, because all other elements of Xdj^9d.i-i 
is known to be zero. Overall, the computation complexity in each iteration of ( [T^ is 0(nTok • K), 
where nTok denotes the number of unique tokens in the document. In practice, we only use a small 
number of iterations, L, in ( [T^ and use 9d,L to approximate i-w ^o that ^ becomes 

p{yd\wd,i-.N,^,U,a,j) ^ piyd\dd,L,U,'y) (13) 

In summary, the inference of 9d and yd can be implemented by the layered architecture in Figure [2] 
where the top layer infers yd using ( [T^ and the MDA layers infer 9d iteratively using ( fTS] ). Figure^ 
also implies that the the MDA layers act as a feature extractor by generating the MAP estimate 9d,L 
for the output layer. Our end-to-end learning strategy developed in the next section jointly learns the 
model parameter U at the output layer and the model parameter $ at the feature extractor layers to 
maximize the posterior of the prediction variable given the input document. 


4 Learning by Mirror-Descent Back Propagation 


We now consider the supervised learning problem Q and the unsupervised learning problem Q, 
respectively, using the developed MDA-based MAP inference. We first consider the supervised 
learning problem. With ([T3|), the discriminative learning problem (|^ can be approximated by 


arg min 

$.[/ 


D 

lnp($|/3) -'^\np{yd\9d,L,U,j) 
d=l 


(14) 


which can be solved by stochastic mirror descent (SMD). Note that the cost function in ( [l4| ) depends 
on U explicitly through p{yd\9d,L,U,j), which can be computed directly from its dehnition in 
Sectioi^ On the other hand, the cost function in ( |T4l i depends on $ implicitly through 9d^L- From 
Figurel^we observe that 9d^L not only depends on $ explicitly (as indicated in the MDA block on 
the right-hand side of Figure S but also depends on $ implicitly via 9d,L-i, which in turn depends 
on $ both explicitly and impicitly (through 9d^L-2) and so on. That is, the dependency of the 
cost function on $ is in a layered manner. Therefore, we devise a back propagation procedure to 
efficiently compute its gradient with respect to $ according to the mirror-descent graph in Figure 
1^ which back propagate the error signal through the MDA blocks at different layers. The gradient 
formula and the implementation details of the learning algorithm can be found in Sections IM in 
the supplementary material. 


For the unsupervised learning problem Q, the gradient of lnp($|/3) with respect to T* assumes the 
same form as that of lnp($|/3). Moreover, it can be shown that the gradient of lnp(wd 4 :Ar|$, a, 7) 
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with respect $ can be expressed as (see Sectionj^of the supplementary material); 

'91np(wrf,|Ar|$,a) ^ | ^ inp(a;d|6>rf, #)| » -^\np{xd\0d,L,^) (15) 

where p(xd|6*d,$) assumes the same form as except $ is replaced by $. The expectation is 
evaluated with respect to the posterior probability p{0d\wdp:N, o;), and is sampled by the MAP 
estimate of 9d in step (a). 9d,L is an approximation of 0d|tiid i n computed via ( [T^ and Figurej^ 

5 Experiments 

5.1 Description of Datasets and Baselines 

We evaluated our proposed supervised learning (denoted as BP-sLDA) and unsupervised learning 
(denoted as BP-LDA) methods on three real-world datasets. The first dataset we use is a large-scale 
dataset built on Amazon movie reviews (AMR) The data set consists of 7.9 million movie 
reviews (1.48 billion words) from Amazon, written by 889,176 users, on a total of 253,059 movies. 
For text preprocessing we removed punctuations and lowercasing capital letters. A vocabulary of 
size 5,000 is built by selecting the most frequent words. (In another setup, we keep the full vocab¬ 
ulary of 701K.) Same as p4j , we shifted the review scores so that they have zero mean. The task 
is formulated as a regression problem, where we seek to predict the rating score using the text of 
the review. Second, we consider a multi-domain sentiment (MultiSent) classification task Q, which 
contains a total 342,104 reviews on 25 types of products, such as apparel, electronics, kitchen and 
housewares. The task is formulated as a binary classification problem to predict the polarity (posi¬ 
tive or negative) of each review. Likewise, we preprocessed the text by removing punctuations and 
lowercasing capital letters, and built a vocabulary of size 1,000 from the most frequent words. In ad¬ 
dition, we also conducted a second binary text classification experiment on a large-scale proprietary 
dataset for business-centric applications (1.2M documents and vocabulary size of 128K). 

The baseline algorithms we considered include Gibbs sampling (Gibbs-LDA) flT) , logistic/linear re¬ 
gression on bag-of-words, supervised-LDA (sLDA) Q, and MedLDA which are implemented 
either in C-H- or Java. And our proposed algorithms are implemented in C#j^ For BP-LDA and 
Gibbs-LDA, we first train the models in an unsupervised manner, and then generate per-document 
topic proportion 9d as their features in the inference steps, on top of which we train a linear (logistic) 
regression model on the regression (classification) tasks. 


5.2 Prediction Performance 


We first evaluate the prediction performance of our models and compare them with the traditional 
(supervised) topic models. Since the training of the baseline topic models takes much longer time 
than BP-sLDA and BP-LDA (see Figure]^, we compare their performance on two smaller datasets, 
namely a subset (79K documents) of AMR (randomly sampled from the 7.9 million reviews) and the 
MultiSent dataset (342K documents), which are all evaluated with 5-fold cross validation. For AMR 
regression, we use the predictive to measure the prediction performance, defined as: pR^ = 
1 — {J2d(yd ~ Vd)^)KThdiVd ~ where y'^ denotes the label of the d-th document in the 

heldout (out-of-fold) set during the 5-fold cross validation, is the mean of all y'^ in the heldout 
set, and yd is the predicted value. The pR^ scores of different models with varying number of topics 
are shown in Figure 3(a) Note that the BP-sLDA model outperforms the other baselines with large 
margin. Moreover, the unsupervised BP-LDA model outperforms the unsupervised LDA model 
trained by Gibbs sampling (Gibbs-LDA). Second, on the MultiSent binary classification task, we 
use the area-under-the-curve (AUC) of the operating curve of probability of correct positive versus 
probability of false positive as our performance metric, which are shown in Figure [3(b)| It also shows 
that BP-sLDA outperforms other methods and that BP-LDA outperforms the Gibbs-LDA model. 


Next, we compare our BP-sLDA model with other strong discriminative models (such as neural net¬ 
works) by conducting two large-scale experiments; (i) regression task on AMR full dataset (7.9M 
documents) and (ii) binary classification task on the proprietary business-centric dataset (L2M doc¬ 
uments). For the large-scale AMR regression, we can see that pR^ improves significantly compared 

^The code will be released soon. 
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Number of topics 

(a) AMR regression task (79K) 


Number of topics 

(b) MultiSent classification task 


Number of topics 

(c) MultiSent task (zoom in) 


Figure 3: Prediction performance on AMR regression task (measured in pB?) and MultiSent classi¬ 
fication task (measured in AUC). Higher score is better for both, with perfect value being one. 


Table 1; pR^ (in percentage) on full AMR data (7.9M documents). The standard deviations in the 
parentheses are obtained from 5-fold cross validation. 


Number of topics 

5 1 10 1 20 1 50 1 100 1 200 

Linear Regression (voc5K) 

38.4 (0.1) 

Neural Network (voc5K) 

59.0(0.1) 

61.0 (0.1) 

62.3(0.4) 

63.5(0.7) 

63.1 (0.8) 

63.5(0.4) 

BP-sLDA (a = 1.001, voc5K) 

61.4 (0.1) 

65.3(0.3) 

69.1 (0.2) 

74.7(0.3) 

74.3 (2.4) 

78.3 (1.1) 

BP-sLDA (a = 0.5, voc5K) 

54.7(0.1) 

54.5(1.2) 

57.0(0.2) 

61.3(0.3) 

67.1 (0.1) 

74.5(0.2) 

BP-sLDA (a = 0.1, voc5K) 

53.3(2.8) 

56.1 (0.1) 

58.4(0.1) 

64.1(0.1) 

70.6 (0.3) 

75.7(0.2) 

Linear Regression (voc701K) 

41.5 (0.2) 

BP-sLDA (a=1.001,voc701K) 

69.8(0.2)1 74.3(0.3)1 78.5 (0.2) | 83.6 (0.6) | 80.1 (0.9) | 84.7(2.8) 


to the best results on the 79K dataset shown in Figure [3(a)l and also significantly outperform the neu¬ 
ral network models with same number of model parameters. Moreover, the best deep neural network 
(200 X 200 in hidden layers) gives pR^ of 76.2%(±0.6%), which is worse than 78.3% of BP-sLDA. 
In addition, BP-sLDA also significantly outperforms Gibbs-sLD^27|, Spectral-sLDA | [24| , and 
the Hybrid method (Gibbs-sLDA initialized with Spectral-sLDA) ||24|7whose pR? scores (reported 
in p^ ) are between 10% and 20% for 5 ^ 10 topics (and deteriorate when further increasing the 
topic number). The results therein are obtained under same setting as this paper. To further demon¬ 
strate the superior performance of BP-sLDA on the large vocabulary scenario, we trained BP-sLDA 
on full vocabulary (701K) AMR and show the results in Tablewhich are even better than the 5K 
vocabulary case. Finally, for the binary text classification task on the proprietary dataset, the AUCs 
are given in Table where BP-sLDA (200 topics) achieves 31% and 18% relative improvements 
over logistic regression and neural network, respectively. Moreover, on this task, BP-sLDA is also 
on par with the best DNN (a larger model consisting of 200 x 200 hidden units with dropout), which 
achieves an AUC of 93.60. 


5.3 Analysis and Discussion 


We now ana lyze t he influence of different hyper parameters on the prediction performance. Note 
from Figure 3(a) that, when we increase the number of topics, the pR^ score of BP-sLDA first 


improves and then slightly deteriorates after it goes beyond 20 topics. This is most likely to be 
caused by overfitting on the small dataset (79K documents), because the BP-sLDA models trained 
on the full 7.9M dataset produce much higher pR^ scores (Table than that on the 79K dataset 
and keep improving as the model size (number of topics) increases. To und erstan d the influence 


of the mirror descent steps on the prediction performance, we plot in Figure 4(a) the pR'^ scores 


of BP-sLDA on the 7.9M AMR dataset for different values of mirror-descent steps L. When L 
increases, for small models {K = 5 and K — 20), the pR^ score remains the same, and, for a larger 
model {K = 100), the pR^ score first improves and then remain the same. One explanation for 
this phenomena is that larger K implies that the inference problem ( [TOl i becomes an optimization 
problem of higher dimension, which requires more mirror descent iterations. Moreover, the mirror- 
descent back propagation, as an end-to-end training of the prediction output, would compensate 
the imperfection caused by the limited number of inference steps, which makes the performance 
insensitive to L once it is large enough. In Figure 4(b) we plot the percentage of the dominant 
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Table 2: AUC (in percentage) on the business-centric proprietary data (1.2M documents, 128K vo¬ 
cabulary). The standard deviations in the parentheses are obtained from five random initializations. 


Number of topics 

5 

10 

20 

50 

100 

200 

Logistic Regression 

90.56 (0.00) 

Neural Network 

90.95 (0.07) 

91.25 (0.05) 

91.32 (0.23) 

91.54 (0.11) 

91.90 (0.05) 

91.98 (0.05) 

BP-sLDA 

92.02 (0.02) 

92.21 (0.03) 

92.35 (0.07) 

92.58 (0.03) 

92.82 (0.07) 

93.50 (0.06) 



Number of mirror descent iterations (layers) 



■g 12 
o 

'o 10 

CT 8 


^■BP-LDA(a»1.001) 

I lBP-LDA(a=0.5) 
□□BP-LDA (0=0.1) 
^■Gibbs-LDA (a=0.5) 
^■Gibbs-LDA (a=0.1) 


Number of topics 


(a) Influence of MDA iterations L (b) Sparsity of the topic distribution 


(c) Per-word log-likelihoods 


Figure 4; Analysis of the behaviors of BP-sLDA and BP-LDA models. 


topics (which add up to 90% probability) on AMR, which shows that BP-sLDA learns sparse topic 
distribution even when a = 1.001 and obtains sparser topic distribution with smaller a (i.e., 0.5 and 
0.1). In Figure [4(c)] we evaluate the per-word log-likelihoods of the unsupervised models on AMR 
dataset using the method in The per-word log-likelihood of BP-LDA with a = 1.001 is worse 
than the case of a = 0.5 and a = 0.1 for Gibbs-LDA, although its prediction performance is better. 
This suggests the importance of the Dirichlet prior in text modeling Em and a potential tradeoff 
between the text modeling performance and the prediction performance. 


5.4 Efficiency in Computation Time 
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- 
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To compare the efficiency of the algorithms, we 
show the training time of different models on the 
AMR dataset (79K and 7.9M) in Figure which 
shows that our algorithm scales well with respect 
to increasing model size (number of topics) and in¬ 
creasing number of data samples. 

6 Conclusion 

We have developed novel learning approaches for 
supervised LDA models, using MAP inference and 
mirror-descent back propagation, which leads to an 
end-to-end discriminative training. We evaluate the 
prediction performance of the model on three real- 
world regression and classification tasks. The re¬ 
sults show that the discriminative training signifi¬ 
cantly improves the performance of the supervised 
LDA model relative to previous learning methods. Future works include (i) exploring faster algo¬ 
rithms for the MAP inference (e.g., accelerated mirror descent), (ii) developing semi-supervised 
learning of LDA using the framework from Q, and (iii) learning a from data. Finally, also note that 
the layered architecture in Figurej^could be viewed as a deep feedforward neural network | [TT| with 
structures designed from the topic model in Figure [T] This opens up a new direction of combining 
the strength of both generative models and neural networks to develop new deep learning models 
that are scalable, interpretable and having high prediction performance for text understanding and 
information retrieval |T3). 
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Figure 5: Training time on the AMR dataset. 
(Tested on Intel Xeon E5-2680 2.80GHz.) 
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Supplementary Material for “End-to-end Learning of EDA by 
Mirror-Descent Back Propagation over a Deep Architecture” 


A Derivation of p(wd,i;Af $) 


To derive p{wd,i:N\dd, ^), we first write p{wd,i:N, Zd,i:N\0d, $) as 


N 

P{wd,l-N, Zd,l-.N\dd, = II Pi'^d,n\zd,n, ^)p{zd,n\0d) (16) 

n—1 

The expression p{wd^i:N\(^d, ‘1’) can be evaluated in closed-form by marginalizing out {zd^n}n=i 1*^ 
the above expression: 


N 


P{Wd,l:N\Sd,^) = E En P{Zd,n\0d) ■ P{Wd,n\zd,n,^) 

Zd,i Zd,N n—1 
N 

= nE PiZd,„\0d) ■ piWd,„\Zd,n,^) 


n—1 Zd,i 


N 


K 


V K 


=nE n''S"' nn» 


Zd,n,j ^dp. 

vj 


n—lZd,n \j = l 




N 


y K 


=nE nn‘'£"’« 

n—lzd,n = l 


^d,nj ^dp. 

vj 


NIK 

= n i; 

"=i \i=i 

V^l \j=l 


VJ 


VJ 


(17) 


where Wd,n,v denotes the w-th element of the 1^ x 1 one-hot vector Wd,n, Wd,n denotes the n-th 
word (token) inside the d-th document, and Xd,v denotes the term frequency of the u-th word (in the 
vocabulary) inside the d-th document. 


B Derivation of the Recursion for Mirror Descent Algorithm 

First, we rewrite the optimization problem o as 

nain [y{Sd — dd,e-i) + — ^{^d, Sd,i-i) 

Sd ld,l 

s.t. ydd = l, Od^Q 


(18) 

(19) 


where Od h 0 denotes that each element of the vector 6d is greater than or equal to zero. Using 
the fact that T'(a;, y) = x’^ \n{x/y) — l^a; -f l^y, the constrained optimization problem ([TSll-lfT^ 
becomes 


min [^eJ{0d,e-i)fidd-0d,e-i) + Tf^ 
6d J-dJ 


ej\n- 


Od 


— t^0d + OdP-1 


Tdl-l 


s.t. t^9d = l, 9d>Q 

Dropping the terms independent of 9d, we can write @-(l^ as 


min [VeJ{9d,i-i)f9d + — 

Idl 


Ojln^ 

^d,l-l 


-t^9d 


( 20 ) 

( 21 ) 

( 22 ) 
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( 23 ) 


s.t. 1^0d = l, Od>Q 
To solve ([22li-([2T|, we write its Lagrangian as 

^d,e 


9i In- 


Od 


9d,e-i 



+ X{1^9d - 1 ) 


(24) 


where we relaxed the nonnegative constraint in the above Lagrange multiplier. However, we will 
show that the solution obtained will automatically be nonnegative mainly because of the logarithm 
term in the cost function. Taking the derivative of L with respect to 9^ and A and setting them to 
zero, we have, respectively. 


dL 

Ml 

dL 

1)\ 


^9j{9d,e-i) 


1 

Td,e 



+ A1 = 0 


1^9d -1 = 0 


which leads to 


9d,i-i 0 exp {—Td^e ■ Ve^/(0d^f-i)) 
exp{Td,i ■ A) 


t'^9d = 1 


Solving the above two equations together, we obtain 

dd = -^9d,e-i QiexTp{—Td/ ■ \7g^f{9d,i-i)) (25) 

where Cg is a normalization factor such that 9d,e adds up to one. Note that the above recursion can 
always guarantee non-negativity of the entries in the vector 9d,e since we will always initialize the 
vector in the feasible region. Recall that f{9d) is the cost function on the right-hand side of ( fTOl i, 
which is given by 

f{9d) = -Xd ln($6'd) - (a - M \n9d 
Therefore, the gradient of f{9d) can be computed as 

= (26) 

Substituting the above gradient formula into (|25]l, we obtain the desired result in ([T^. 


C Implementation Details of the BP-sLDA 

In this section, we describe the implementation details of the mirror-descent back propagation for 
the end-to-end learning of the supervised LDA model. Specifically, we will describe the details of 
the inference algorithm, and the model parameter estimation algorithm. 


C.l Inference algorithm: Mirror Descent 


Let f{9d) denote the objective function in ( [I^ . As we discussed in the paper, we use recursion ( [T^ 
to iteratively find the MAP estimate of 9d given Wd,i:N, which we repeat below: 


9d,t 


— ■ 9d,i-i © exp 
^g 



^9d^e-i 


a — 1 

9d,e-i_ 




9d,o = 


(28) 


The step-size Td^e in mirror descent can be chosen to be either constant, i.e., Td^e = T, or adaptive 
over iterations i and documents d. To adaptively determine the step-size, we can use line search 
procedure. The inference algorithm with a simple line search can be implemented as Algorithmic 
where 9d/-i) can also be replaced by the squared vector 1-norm: 


f{9d,e) < f{9d.e-i) + [^edf{9d,e-i)]'^{9d,£ - 9d,£-i) -f — — \\9d,£ - 9d,i-i\\\ (29) 

^dd,l 


The line search approach determines the step-sizes adaptively, automatically stabilizing the algo¬ 
rithm and making inference converge faster. Moreover, the unsupervised model (BP-LDA) uses the 
same form of inference algorithm except that T* is replaced with $ and is no longer needed. 
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d,^-l 


Algorithm 1 MAP Inference for BP-sLDA: MiiTor-Descent with Line Search 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 

13 

14 


— 0d,e-i) + j^'^{9d/,0d,e-i) then 


Initialization: 0^,0 = Td^. 

for £ = 1,..., L do 

Td,e = Td,i-i/r], where 0 < 77 < 1 (e.g., 77 = 0.5). 

while 1 do 

0d,e = ^ ■ Od,i-i © exp (Td,e 
a f{0d,i) > fi0d,e-i] 

Td,e ^ V • Td,i 

else 

break 

end if 
end while 
end for 

Inference result of Od'. 0d,L- 
Inference result of yd- 

( \0 U \ regression 

P Ud d,L^ ,7 |_Softmax(7t70c;) classification 


(27) 


C.2 Parameter Estimation: Stochastic Gradient Descent with Back Propagation 

We hrst rewrite the training cost d as 

D 

J(C/,$) =^Qd(t/,$) (30) 

d^l 

where Qd{') denotes the loss function at the d-ih document, defined as 

Qd(C7, $) = -^lnp($|/3) - lnp(j/d| 6 'd.L, C/, 7 ) (31) 

Note that, we do not have constraint on the model parameter U. Therefore, to update U, we can 
directly use the standard mini-batch stochastic gradient descent (SGD) algorithm. On the other 
hand, each column of the model parameter $ is constrained to be on a (1/— l)-dimension probability 
simplex, i.e, each element of <I> has to be nonnegative and each column sum up to one (i.e., $ is a left- 
stochastic matrix). For this reason, we use stochastic mirror descent (SMD) to update each column 
of the model parameter $, which is akin to the mirror descent algorithm for inference except that 
the gradient is replaced by stochastic gradient. The parameter estimation (learning) algorithm is 
described in Algorithm|^ where the expressions for the stochastic gradients and are given 
in the next section. Note that we are allowing different columns of <I> to have different (and adaptive) 
learning rate, which makes the learning algorithm converge faster. This design is also akin to the 
construction in AdaGrad Q. Finally, we also apply running average to the model parameters during 
SGD and SMD, which could improve the learning performance. In practical implementation, we 
could start the running average after after several passes of the training data. 

D Gradient Formula of BP-sLDA 

In this section, we give the gradient formula for the supervised learning of BP-sLDA. To this end, 
we hrst rewrite the training cost d as 

D 

J(C/,$) =^Qd((/,$) (35) 

d^l 

where Qd{-) denotes the loss function at the d-ih document, defined as 

Qd{U,^) = -^lnp($|/3) -\np{yd\0d,L,U,'y) (36) 
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Algorithm 2 Parameter Estimation for BP-sLDA; Stochastic Mirror Descent. 

1: for t = 1,2,... until converge do 

2: Sample a mini-batch of documents, denoted by Vt- 

3: Infer and 9^, using Algorithm [T] for each document d GVt. 

4: Compute the stochastic gradient for d GVt according to (|4gi. 

5: Compute the stochastic gradient dQd/d^ for d GVt according to Algorithm]^ 

6: Compute the averaged stochastic gradient over Vt: 


AUt = 


1 

W\ 


E 

d^'Dt 


dQd 

dU 


= 


1 

W\ 


E 

d^T)t 


dQd 




where Ut-i and denote the estimates of U and <!> up to mini-batch t — 1. 
1: Update U:Ut = Ut-i - • AC/*. 

8: for each column of $, j = 1,..., AT do 

9: Set learning rate: = A^o/ (^^/fpEEJaE^ + 

10: Update 

4'j,t = © exp ■ Apjd) 

where Crf,- ^ is a normalization factor that makes add up to one. 

11: end for 

12: Performing running average of the model parameters: 

Ut = + ]ut 

.St t-1- 1 

+ -^>4 

13: end for 

14: At convergence, Ut and will be final model parameters. 


(32) 


(33) 

(34) 


The expressions for the two terms in ( |3^ are given by 

K K V 


-^lnp($|/3) = -^In 


(Rf) nn* 


-1 

vj 


3 = 1 «=1 


K y 


= — — ^ E^'^ ~ constant 


3 = 1 «=i 
V 

- E : 


■ Po,d,j) 


-lnp(yd|0d,L,C/,7) = <1 ^i=i ’ E„^=lexp(7•Po.d.„^) 


27 


l?/d-Po.d|l 2 +constant 


classification 


regression 


(37) 


c c 

~ E + In E cxp(7 • Po,d,m) classification 

j—1 m—1 

7 ;-\\yd - Po,d \\2 + constant regression 

. 27 

where C in the above expressions is the number of output classes (in classification case), and 


(38) 


Po,d = UOd^L 


(39) 
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Algorithm 3 MiiTor-Descent Back Propagation for BP-sLDA 


Initialization of the error signal: ^d,L = —{I — ' liUd — Vd) 

for f = L,..., 1 do 




Qd,iQ^d,e. 

Od,i-l 


— Td^e- 
T 




.} 


{&d,eQ^d,e)^ 


A^d,e = Td^e ■ © ^d,eV - © U,i) © 

end for 

Compute the stochastic gradient dQd/d^ according to: 

L 


^dJ-l 


dQd 

3$ 


1 ^ 
D ' $ 


A^d,(. 


(42) 


r=i 


Note that the choice of p{yd\0d,L:U,^) is not restricted to the above two options in our frame¬ 
work. Other forms could also be used and the corresponding gradient formula could also be derived. 
However, in sequel, we will only derive the gradient formula for these two classical choices. 


D.l Gradient with respect to U 

First, we derive the gradient of Qd{‘) with respect U. Note that the only term in ( [36l l depending on 

U K\np{yd\9d,L,U,-i). Therefore, we have a(5d/9?7 = -d\np(jjd\9d,L,U,-i)/dU. Taking the 

gradient of ( [38l l with respect to U and after some simple algebra, we get 

dQd 
dU 

where ijd is defined as 

rSoftmax (7 • po d), classification 
yd = < 

lPo,d, regression 

JSoftmax (7 • U9d,L), classification 
\U0d,L, regression 


-7 • iVd - yd)9d,L 
-7 • {yd-yd)0j^L 


classification 

regression 


(40) 


D.2 Gradient with respect to $ 

In this subsection, we summarize the gradient expression for dQd/d^ in Algorithm]^ where the 
derivation can be found in the next subsection. In Algorithm]^ Xd and yd are the input bag-of-words 
vector and the label for the d-th document. The quantities 9d^i and yd are obtained and stored during 
the inference step, and the mirror-descent step-size Td^ is the one determined by line-search in the 
inference step (see Algorithm[^. 

Similar to the inference in Algorithm[T] the above gradients can be computed efficiently by exploit¬ 
ing the sparsity of the vector Xd- For example, only the elements at the nonzero positions of Xd need 
to be computed for ^9d^i.-i and ^{9d^i © ^^ 7 ) since ^ and are known to be zero at 

these positions. Moreover, although (/3 — 1)/$ is a dense matrix operation, it is the same within one 
mini-batch and can therefore be computed only once over each mini-batch, which can significantly 
reduce the amount of computation. 


D.3 Derivation of the gradient with respect to $ 

In this subsection, we derive the gradient formula for $. Note from ( |3^ that, there are two terms 
that depend on $, and 

^ ^ (^-^lnp($|/3)^ + (43) 
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The first term depends on $ explicitly and its gradient can be evaluated as 

The second term, however, depends on $ implicitly through 9d,L- From Figure]^ we observe that 
6d.L not only depends on $ explicitly (as indicated in the MDA block on the nght-hand side of 
Figure!^ but also depends on T* implicitly via 9d,L-i, which in turn depends on $ both explicitly 
and im^icitly (through 0d,L-2) and so on. That is, the dependency of the cost function on $ is in 
a layered manner. For this reason, we need to apply chain rule to derive the its full gradient with 
respect to $, which we describe below. 

First, as we discussed above, each MDA block in Figurel^contains T*, and Qd{U, $) depends on the 
<1) appeared at different layers through 9d,L, ■ ■ ■, To derive the gradient formula, we first denote 
these <1) at different layers as > ‘ki, and introduce an auxiliary function Rd{U, <i>i,..., 

to represent — lJip{yd\(^d,L, U,^) with its $ “untied” across layers in Figure]^ Then, the original 
-\np{yd\9d,L,U,^) can be viewed as 

- lnp(j/d|0d.L, C/, 7) = RdiU,^,..., $) (45) 

where $i = • • • = = $. Therefore, we have 

±(^-lnpiyd\9d,L,U,^))=j2^^ 

e=i ’■ 

where dRdjd^i denotes the gradient of Rd{U, $ 1 ,..., 4)^) with respect to Therefore, we only 
need to compute the gradient dRd/d^i. 




(46) 


For simplicity of notation, we drop the subscript of d in 9d,e. And since 4) is untied across layers in 
the mirror descent recursion ( [T^ for the computation of Rd{U, 4?i,..., 4>i), we can rewrite ( [T^ as 

rp Xd rr 11 


ze = Tdf 


4> 


9i-] 


Pi = 9i-i © exp(zf) 

- JPL_ 

' t^Pi 

where Zi and pg are intermediate variables, and $ is replaced with 
dRd/d^i, it suffices to derive dRd /Note that 

dRd _ dpj dRd _ dpj „ 


where 


d^iji dpi d^iji 

. A dRd 


(47) 

(48) 

(49) 

To derive the gradient 

(50) 

(51) 


is an intermediate quantities that follows a backward recursion to be derived later. To proceed, we 
need to derive dp>\ jd^iji. 


dpj _ ^ 9exp(zJ) 




= eu © 

= Cl 0 

= Cl 0 


dzj 

d^i,jt 

dzJ 

d^£,ii 


■ diag(exp(z^)) 
© 1 exp{zj) 


= 9j_i © exp( 2 ;J) © 


dzj 

d^LH 


= Pi Q 


dzj 

d^i,, 


(52) 
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Then, we need to derive the expression for dzf 


dzf 

9$ 




= Td/ 
= Td^i 
= Td^ 


d 




d^i 




oT 


x'^ 


d rp 

eu<^j ■ 


d^LH 


■ diag ( 


Xd 


= Td,i ■ ■ diag 

= Td^l ■ i —[0£_l]i 


Xd 




+ 


■^d p 


d rp 


Xd 




Xd 


(53) 


-I j 


where Ci denotes the i-th natural basis vector in Euclidean space (i.e., the vector with the i-th element 
being one and all other element equal to zero), and Eji denotes a matrix whose (j, i)-th element is 
one and all other elements are zero. Substituting the above expression into (B^, we obtain 


^ T dzj 
= PJ © 

T 


^Pj _T 


d^eji 

= Td^i ■ Pt Q \ —[9t-i\i 


Xd 


-i)^ 




Xd 




(54) 


Therefore, 

dRd 


dpj 


= Td^e ■ PiQ \ — [di-i\i 


_ ^ _ e-'$ 


Xd 


= Td^e 
= Td,e 
= Td,e 
= Td/ 
= Tdi 


l-l\i 


Xd 


{pe © ej 6(, + 




{pe © eJ 6i + 


Xd 


Xd 

Xd 


-I 3 


Xd 


{Pl © ef)Si^ 
[Pi\i ■ [<5f]i| 
[pe]i ■ [St]i 

3 


-[9i]i 


Xd 


-1) 

Xd 




ej^i{pi-i © 5i) 


[^i{p£ © 5t)]^ 


Xd 

[Pf\i ■ [(5r]ij 


Xd 

[pt]i ■ [<5c]*j 



(55) 


Writing the above expressions into matrix form (derivative with respect $^), we obtain: 


dRd 

d^£ 


= Td, 


. / 

\ ^i0e- 


-{pi © h)'^ - 


^t{pi © ^e) © 


Xd 




( 56 ) 


Now we need to derive the recursion for computing 5i. By the definition of 5i in ([ST]), we have 


A dRd 
£-1 = 


dpi-i 
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( 57 ) 


_ d6j_^ dpj dRd 

dpi^i dei_i dpi 

deu dpJ 

dpi_i d6i_i ^ 

QQl o T 

To continue, we have to evaluate and ■ By (|47|l-(|49ll, we have 

dpj 


ddj_i . , T\ -.i/iT 9exp(zJ) 

= afcl ® * ' ® -isJr- 


= 7 © [1 exp(zj’)] + 10^1 © 
= diag(exp(z^)) + t9j_^ © 
= diag(exp(z^)) + t9j_^ © 


d9i_i dzi 
dzj 


dzj dej 
dzi 

diag(exp(2:£)) 

dzj 


© 1 exp( 2 :J) 
dz'^ 

= diag(exp(z^)) + 1 [9j_-^ © exp(zj)] © —^ 

Otii-i 


= diag(exp(z^)) + IpJ © 


dzj 

d9i_i 


To proceed, we need to derive the expression for 


dzj . 
d8e-i ■ 


dzj ^ 

= ^d,t 


d 


d f a — t 


d9i_^ 


= T, 


d,e ■ 


d9e_i \9J_^^J 


= Td/ ■ |-$fdiag 
= -Td,e ■ (^’fdiag 


diag 

Xd 


d9i-i \ 0£_i 
Xd 


i^j9i-ir 


{<^j9,-ir 

Xd 

{<^j9e-i)^ 


- diag 
+ diag 


$£ - diag 
a — t 


a — 1 

W7 


a — 1 


Substituting the above expression into ([58|), we get the expression for 


e-i 


d9i-i ' 


dpj 

d9f,_. 


= diag exp Td^i 


$ 


a — 1 


^i9i-i 9e-i 


T 


- Td,i ■ {tpj) © 
PI 


diag 


Xd 




= diag 
= i diag 


- Td,t ■ {tpl ) © 


9(.-i 


— Td,i 


diag 


diag 
Xd 


+ diag 
Xd 




+ diag 


a — 1 


+ diag 
a — 1 


a — 1 


^e-i 


oU 


(58) 


(59) 


diag(pf) 

(60) 


QQ 1 

To complete the derivation of the recursion ( [57] i, we need to derive QpJ~J^ , which is given by 

d9j_^ dpj_i 1 5/1 


dpi^i dpi^i t^pi-i dpi^i yi'^pi^i 


T 

Pi-1 = 


I - 19j_^ 
t^Pi-i 


(61) 


Expressions ( |57| |, ( |60l l and ( |6T] ) provide the complete backward recursion for Sg from i = L to 
£ = 1. Finally, to initialize the backward recursion, we need the expression for (5 l. By its definition. 
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we have 


5l = 


dRd 



dpL 



del 

dpi, 

i dRd 

dpL 

deL 

dpo,d 

del 


dRd 

dpL 

■ 

dpo,d 

1 

f^PL 

{I- 

10l)■ 


dRd 


where in the last step we substituted ( |6T] ). By ( |45] l and([38]l, we have 

dRd d 


dpo,d 

e have 

- lnp(yd|6»d.L,t/,7)) 


Therefore, 


( 62 ) 


= 


dpo,d dpo,d 

_ \-l ■ {Vd- Vd) classification 
1 -7 ■ (t/d - Vd) regression 


-(/ — 10^) • • 7 • (t/d — Vd) classification 




(63) 


(64) 


regression 


As a final remark, we found that in practical implementation p£ could be very large while could 
be small, which leads to potential numerical instability. To address this issue, we introduce the 
following new variable: 

id,i = • St (65) 

Then, the quantities pt, and 5i can be replaced with one variable ^d,i 7 and the backward recursion of 
5i can also be replaced with the backward recursion of ^d,i- Introducing = dRd/d^i and with 
some simple algebra, we obtain the back propagation and gradient expression for T* in Algorithm]^ 


E Gradient Formula of BP-LDA 

The unsupervised learning problem Q can be rewritten, equivalently, as minimizing the following 
cost function: 


D 


J(l>) = ^Qd($) 


d=l 


where Qd(^) is the loss function defined as 


( 66 ) 


Qd(4’) = lnp($|/3) - lnp('u;d,i:Ar|T>,a) 


Taking the gradient of both sides of we obtain 

dQd d f 1 


5$ \ D 

The first term in ( |68| l has already been derived in ( |44l i: 

^lnp($|/3) = 

(9$ <1) 


(67) 


(~ A lnp(l>|/3)) + ^ (- ^np{wd.i-.N\^, a)) (68) 


(69) 


where denotes elementwise division of the scalar /3 — 1 by the matrix $. We now proceed to 
derive the second term in 


d ~ 1 d ~ 

^lnp(wd,i:jv|$,a) = —-—r ■ -^p{wd,i-.N\<^, a) 

d<^> p(wd,i:Ar|$,a) 5$ 
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d 


piwd,l:N\^,a) 5$ 

1 

P{Wd,l:N\^,a) 

1 

piwd,i:N\^,a) 


d 


d 


p{wd.i,N,0d\^,Q)d6d 

dOd 


-7^p{wd,i:N,0d\^,a) 

9$ 


d 

\np{wd,i:N,dd\^,a) 

(9$ 


— Efl 


\np{wd,i:N,dd\^,a) 

9$ 

d 

\np{wd,i:N,dd\^,a) 

9$ 


d 


piwd,i-.N,dd\^,a)d9d 

p{wd,i,N,9d\‘^,a) 

■ -=- dOd 

piwd,i:N\^,a) 

■ pi9d\wd,i-.N,^,a)d0d 


-^\np{wd,i-.N,dd\^,a) 

9$ 


Using (|^, we rewrite \np{wd,i:N, 9d\^, ce) as 

^np{wd,i:N, 9d\^, a) = ^np{wd.i:N, 9d\^, a) 

= lnp{wd.i-.N\9d, $) + lnp{0d\a) 
= ^np{xd\0d, $) + lnp(6»d|a) 


(70) 


(71) 


Note that expression applies expectation after taking the gradient with respect to $. Therefore, 
the gradient of lnp(iy(i i:Ar, 0d\^, ct) inside the expectation of ( |70l l is taken by assuming that 9d is 
independent of Taking the gradient of both sides of ( |7T| l and using this fact, we obtain 

d d 

\np{wd,i-.N, a) = ^ lnp{xd\0d, $) (72) 

9$ 9<I) 

Substituting the above expression into (|70|), we obtain the desired result. 
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